监督学习方法可以在存在大量标记数据的情况下解决给定的问题。但是,涵盖所有目标类的数据集的采集通常需要昂贵且耗时的手动标签。零击学习模型能够通过利用其语义信息来对看不见的概念进行分类。本研究通过使用非线性声音 - 语义投影介绍了图像嵌入作为有关零击音频分类的附带信息。我们从开放图像数据集中提取语义图像表示形式,并使用不同域中的语义信息在音频集的音频子集上评估模型的性能;图像,音频和文字。我们证明,图像嵌入可以用作语义信息来执行零击音频分类。实验结果表明,图像和文本嵌入式单独和一起显示相似的性能。我们还从测试样品中计算出语义声嵌入,以提供性能的上限。结果表明,分类性能对测试和训练类之间的语义关系以及文本和图像嵌入之间的语义关系高度敏感,当时可见和看不见的类在语义上相似时,可以直至语义声学嵌入。
translated by 谷歌翻译
The vision community has explored numerous pose guided human editing methods due to their extensive practical applications. Most of these methods still use an image-to-image formulation in which a single image is given as input to produce an edited image as output. However, the problem is ill-defined in cases when the target pose is significantly different from the input pose. Existing methods then resort to in-painting or style transfer to handle occlusions and preserve content. In this paper, we explore the utilization of multiple views to minimize the issue of missing information and generate an accurate representation of the underlying human model. To fuse the knowledge from multiple viewpoints, we design a selector network that takes the pose keypoints and texture from images and generates an interpretable per-pixel selection map. After that, the encodings from a separate network (trained on a single image human reposing task) are merged in the latent space. This enables us to generate accurate, precise, and visually coherent images for different editing tasks. We show the application of our network on 2 newly proposed tasks - Multi-view human reposing, and Mix-and-match human image generation. Additionally, we study the limitations of single-view editing and scenarios in which multi-view provides a much better alternative.
translated by 谷歌翻译
Actively monitoring machine learning models during production operations helps ensure prediction quality and detection and remediation of unexpected or undesired conditions. Monitoring models already deployed in big data environments brings the additional challenges of adding monitoring in parallel to the existing modelling workflow and controlling resource requirements. In this paper, we describe (1) a framework for monitoring machine learning models; and, (2) its implementation for a big data supply chain application. We use our implementation to study drift in model features, predictions, and performance on three real data sets. We compare hypothesis test and information theoretic approaches to drift detection in features and predictions using the Kolmogorov-Smirnov distance and Bhattacharyya coefficient. Results showed that model performance was stable over the evaluation period. Features and predictions showed statistically significant drifts; however, these drifts were not linked to changes in model performance during the time of our study.
translated by 谷歌翻译
Statistical language models conventionally implement representation learning based on the contextual distribution of words or other formal units, whereas any information related to the logographic features of written text are often ignored, assuming they should be retrieved relying on the cooccurence statistics. On the other hand, as language models become larger and require more data to learn reliable representations, such assumptions may start to fall back, especially under conditions of data sparsity. Many languages, including Chinese and Vietnamese, use logographic writing systems where surface forms are represented as a visual organization of smaller graphemic units, which often contain many semantic cues. In this paper, we present a novel study which explores the benefits of providing language models with logographic information in learning better semantic representations. We test our hypothesis in the natural language inference (NLI) task by evaluating the benefit of computing multi-modal representations that combine contextual information with glyph information. Our evaluation results in six languages with different typology and writing systems suggest significant benefits of using multi-modal embeddings in languages with logograhic systems, especially for words with less occurence statistics.
translated by 谷歌翻译
translated by 谷歌翻译
translated by 谷歌翻译
基于流量的生成超分辨率(SR)模型学会生产一组可行的SR解决方案,称为SR空间。 SR溶液的多样性随着潜在变量的温度($ \ tau $)的增加而增加,这引入了样品溶液之间纹理的随机变化,从而导致视觉伪像和低忠诚度。在本文中,我们提出了一种简单但有效的图像结合/融合方法,以获得消除随机伪像的单个SR图像,并改善忠诚度,而不会显着损害感知质量。我们通过从流量模型跨越的SR空间中的一系列可行的光真实解决方案中受益,从而实现这一目标。我们提出了不同的图像结合和融合策略,这些策略提供了多种途径,可以根据手头任务的保真度与感知质量要求,以可控的方式将SR Slace样本解决方案移至感知延伸平面中更为理想的目的地。实验结果表明,与流量模型和经过对抗训练的模型所产生的样本SR图像相比,我们的图像结合/融合策略在定量指标和视觉质量方面实现了更有希望的感知依赖权衡。
translated by 谷歌翻译
一种被称为优先体验重播(PER)的广泛研究的深钢筋学习(RL)技术使代理可以从与其时间差异(TD)误差成正比的过渡中学习。尽管已经表明,PER是离散作用域中深度RL方法总体性能的最关键组成部分之一,但许多经验研究表明,在连续控制中,它的表现非常低于参与者 - 批评算法。从理论上讲,我们表明,无法有效地通过具有较大TD错误的过渡对演员网络进行训练。结果,在Q网络下计算的近似策略梯度与在最佳Q功能下计算的实际梯度不同。在此激励的基础上,我们引入了一种新颖的经验重播抽样框架,用于演员批评方法,该框架还认为稳定性和最新发现的问题是Per的经验表现不佳。引入的算法提出了对演员和评论家网络的有效和高效培训的改进的新分支。一系列广泛的实验验证了我们的理论主张,并证明了引入的方法显着优于竞争方法,并获得了与标准的非政策参与者 - 批评算法相比,获得最先进的结果。
translated by 谷歌翻译
从2D图像中估算3D人的姿势和形状是一项至关重要但具有挑战性的任务。虽然先前具有基于模型表示的方法可以在全身图像上表现出色,但当身体的一部分被遮住或框架外面时,它们通常会失败。此外,这些结果通常不会忠实地捕获人类的轮廓,因为它们的可变形模型有限(例如,仅代表裸体)。另一种方法是估计图像空间中预定义模板主体的密集顶点。这样的表示有效地将顶点定位在图像中,但无法处理框架外的身体部位。在这项工作中,我们学习了对部分观察的强大人体估计。我们明确地对X,Y和Z轴中人类关节和顶点的可见性进行了建模。 X和Y轴中的可见性有助于区分框架外情况,深度轴的可见性对应于闭塞(其他对象的自我闭合或遮挡)。我们从密集的紫外线对应关系中获得可见性标签的伪基,并训练神经网络以预测可见性以及3D坐标。我们表明,可见性可以用作1)额外的信号,以解决自锁定顶点的歧义深度的歧义,以及2)将人体模型拟合到预测时的正则化项。对多个3D人类数据集进行的广泛实验表明,可见性建模显着提高了人体估计的准确性,尤其是对于部分体型病例。我们的带代码的项目页面at:https://github.com/chhankyao/visdb。
translated by 谷歌翻译
牙科时代是确定个人年龄的最可靠方法之一。通过使用牙科全景射线照相(DPR)图像,法医科学中的医师和病理学家试图建立没有有效法律记录或注册患者的个人的年代年龄。实践中当前的方法需要密集的劳动,时间和合格的专家。在医学图像处理领域,深度学习算法的发展提高了预测真实价值的敏感性,同时降低了成像时间的处理速度。这项研究提出了一种自动化方法,以使用1,332个DPR图像估算8至68岁的个体的法医年龄。最初,使用基于转移学习的模型进行了实验分析,包括InceptionV3,Densenet201,EdgitionNetB4,MobilenetV2,VGG16和Resnet50V2;因此,修改了表现最好的模型InceptionV3,并开发了新的神经网络模型。减少开发模型体系结构中已经可用的参数数量,从而更快,更准确。所达到的结果的性能指标如下:平均绝对误差(MAE)为3.13,均方根误差(RMSE)为4.77,相关系数r $ $^2 $为87%。可以想象将新模型作为法医学和牙科医学中的潜在可靠和实用的辅助设备。
translated by 谷歌翻译